43 research outputs found

    Mixture Modeling and Outlier Detection in Microarray Data Analysis

    Get PDF
    Microarray technology has become a dynamic tool in gene expression analysis because it allows for the simultaneous measurement of thousands of gene expressions. Uniqueness in experimental units and microarray data platforms, coupled with how gene expressions are obtained, make the field open for interesting research questions. In this dissertation, we present our investigations of two independent studies related to microarray data analysis. First, we study a recent platform in biology and bioinformatics that compares the quality of genetic information from exfoliated colonocytes in fecal matter with genetic material from mucosa cells within the colon. Using the intraclass correlation coe�cient (ICC) as a measure of reproducibility, we assess the reliability of density estimation obtained from preliminary analysis of fecal and mucosa data sets. Numerical findings clearly show that the distribution is comprised of two components. For measurements between 0 and 1, it is natural to assume that the data points are from a beta-mixture distribution. We explore whether ICC values should be modeled with a beta mixture or transformed first and fit with a normal mixture. We find that the use of mixture of normals in the inverse-probit transformed scale is less sensitive toward model mis-specification; otherwise a biased conclusion could be reached. By using the normal mixture approach to compare the ICC distributions of fecal and mucosa samples, we observe the quality of reproducible genes in fecal array data to be comparable with that in mucosa arrays. For microarray data, within-gene variance estimation is often challenging due to the high frequency of low replication studies. Several methodologies have been developed to strengthen variance terms by borrowing information across genes. However, even with such accommodations, variance may be initiated by the presence of outliers. For our second study, we propose a robust modification of optimal shrinkage variance estimation to improve outlier detection. In order to increase power, we suggest grouping standardized data so that information shared across genes is similar in distribution. Simulation studies and analysis of real colon cancer microarray data reveal that our methodology provides a technique which is insensitive to outliers, free of distributional assumptions, effective for small sample size, and data adaptive

    Evaluation of fecal mRNA reproducibility via a marginal transformed mixture modeling approach

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Developing and evaluating new technology that enables researchers to recover gene-expression levels of colonic cells from fecal samples could be key to a non-invasive screening tool for early detection of colon cancer. The current study, to the best of our knowledge, is the first to investigate and report the reproducibility of fecal microarray data. Using the intraclass correlation coefficient (ICC) as a measure of reproducibility and the preliminary analysis of fecal and mucosal data, we assessed the reliability of mixture density estimation and the reproducibility of fecal microarray data. Using Monte Carlo-based methods, we explored whether ICC values should be modeled as a beta-mixture or transformed first and fitted with a normal-mixture. We used outcomes from bootstrapped goodness-of-fit tests to determine which approach is less sensitive toward potential violation of distributional assumptions.</p> <p>Results</p> <p>The graphical examination of both the distributions of ICC and probit-transformed ICC (PT-ICC) clearly shows that there are two components in the distributions. For ICC measurements, which are between 0 and 1, the practice in literature has been to assume that the data points are from a beta-mixture distribution. Nevertheless, in our study we show that the use of a normal-mixture modeling approach on PT-ICC could provide superior performance.</p> <p>Conclusions</p> <p>When modeling ICC values of gene expression levels, using mixture of normals in the probit-transformed (PT) scale is less sensitive toward model mis-specification than using mixture of betas. We show that a biased conclusion could be made if we follow the traditional approach and model the two sets of ICC values using the mixture of betas directly. The problematic estimation arises from the sensitivity of beta-mixtures toward model mis-specification, particularly when there are observations in the neighborhood of the the boundary points, 0 or 1. Since beta-mixture modeling is commonly used in approximating the distribution of measurements between 0 and 1, our findings have important implications beyond the findings of the current study. By using the normal-mixture approach on PT-ICC, we observed the quality of reproducible genes in fecal array data to be comparable to those in mucosal arrays.</p

    Multi-class computational evolution: development, benchmark evaluation and application to RNA-Seq biomarker discovery

    No full text
    Abstract Background A computational evolution system (CES) is a knowledge discovery engine that can identify subtle, synergistic relationships in large datasets. Pareto optimization allows CESs to balance accuracy with model complexity when evolving classifiers. Using Pareto optimization, a CES is able to identify a very small number of features while maintaining high classification accuracy. A CES can be designed for various types of data, and the user can exploit expert knowledge about the classification problem in order to improve discrimination between classes. These characteristics give CES an advantage over other classification and feature selection algorithms, particularly when the goal is to identify a small number of highly relevant, non-redundant biomarkers. Previously, CESs have been developed only for binary class datasets. In this study, we developed a multi-class CES. Results The multi-class CES was compared to three common feature selection and classification algorithms: support vector machine (SVM), random k-nearest neighbor (RKNN), and random forest (RF). The algorithms were evaluated on three distinct multi-class RNA sequencing datasets. The comparison criteria were run-time, classification accuracy, number of selected features, and stability of selected feature set (as measured by the Tanimoto distance). The performance of each algorithm was data-dependent. CES performed best on the dataset with the smallest sample size, indicating that CES has a unique advantage since the accuracy of most classification methods suffer when sample size is small. Conclusion The multi-class extension of CES increases the appeal of its application to complex, multi-class datasets in order to identify important biomarkers and features

    Dietary Iodine Sufficiency and Moderate Insufficiency in the Lactating Mother and Nursing Infant: A Computational Perspective.

    No full text
    The Institute of Medicine recommends that lactating women ingest 290 μg iodide/d and a nursing infant, less than two years of age, 110 μg/d. The World Health Organization, United Nations Children's Fund, and International Council for the Control of Iodine Deficiency Disorders recommend population maternal and infant urinary iodide concentrations ≥ 100 μg/L to ensure iodide sufficiency. For breast milk, researchers have proposed an iodide concentration range of 150-180 μg/L indicates iodide sufficiency for the mother and infant, however no national or international guidelines exist for breast milk iodine concentration. For the first time, a lactating woman and nursing infant biologically based model, from delivery to 90 days postpartum, was constructed to predict maternal and infant urinary iodide concentration, breast milk iodide concentration, the amount of iodide transferred in breast milk to the nursing infant each day and maternal and infant serum thyroid hormone kinetics. The maternal and infant models each consisted of three sub-models, iodide, thyroxine (T4), and triiodothyronine (T3). Using our model to simulate a maternal intake of 290 μg iodide/d, the average daily amount of iodide ingested by the nursing infant, after 4 days of life, gradually increased from 50 to 101 μg/day over 90 days postpartum. The predicted average lactating mother and infant urinary iodide concentrations were both in excess of 100 μg/L and the predicted average breast milk iodide concentration, 157 μg/L. The predicted serum thyroid hormones (T4, free T4 (fT4), and T3) in both the nursing infant and lactating mother were indicative of euthyroidism. The model was calibrated using serum thyroid hormone concentrations for lactating women from the United States and was successful in predicting serum T4 and fT4 levels (within a factor of two) for lactating women in other countries. T3 levels were adequately predicted. Infant serum thyroid hormone levels were adequately predicted for most data. For moderate iodide deficient conditions, where dietary iodide intake may range from 50 to 150 μg/d for the lactating mother, the model satisfactorily described the iodide measurements, although with some variation, in urine and breast milk. Predictions of serum thyroid hormones in moderately iodide deficient lactating women (50 μg/d) and nursing infants did not closely agree with mean reported serum thyroid hormone levels, however, predictions were usually within a factor of two. Excellent agreement between prediction and observation was obtained for a recent moderate iodide deficiency study in lactating women. Measurements included iodide levels in urine of infant and mother, iodide in breast milk, and serum thyroid hormone levels in infant and mother. A maternal iodide intake of 50 μg/d resulted in a predicted 29-32% reduction in serum T4 and fT4 in nursing infants, however the reduced serum levels of T4 and fT4 were within most of the published reference intervals for infant. This biologically based model is an important first step at integrating the rapid changes that occur in the thyroid system of the nursing newborn in order to predict adverse outcomes from exposure to thyroid acting chemicals, drugs, radioactive materials or iodine deficiency

    An Iterative Leave-One-Out Approach to Outlier Detection in RNA-Seq Data.

    No full text
    The discrete data structure and large sequencing depth of RNA sequencing (RNA-seq) experiments can often generate outlier read counts in one or more RNA samples within a homogeneous group. Thus, how to identify and manage outlier observations in RNA-seq data is an emerging topic of interest. One of the main objectives in these research efforts is to develop statistical methodology that effectively balances the impact of outlier observations and achieves maximal power for statistical testing. To reach that goal, strengthening the accuracy of outlier detection is an important precursor. Current outlier detection algorithms for RNA-seq data are executed within a testing framework and may be sensitive to sparse data and heavy-tailed distributions. Therefore, we propose a univariate algorithm that utilizes a probabilistic approach to measure the deviation between an observation and the distribution generating the remaining data and implement it within in an iterative leave-one-out design strategy. Analyses of real and simulated RNA-seq data show that the proposed methodology has higher outlier detection rates for both non-normalized and normalized negative binomial distributed data

    Low-Frequency Mutational Heterogeneity of Invasive Ductal Carcinoma Subtypes: Information to Direct Precision Oncology

    No full text
    Information regarding the role of low-frequency hotspot cancer-driver mutations (CDMs) in breast carcinogenesis and therapeutic response is limited. Using the sensitive and quantitative Allele-specific Competitor Blocker PCR (ACB-PCR) approach, mutant fractions (MFs) of six CDMs (PIK3CA H1047R and E545K, KRAS G12D and G12V, HRAS G12D, and BRAF V600E) were quantified in invasive ductal carcinomas (IDCs; including ~20 samples per subtype). Measurable levels (i.e., &#8805; 1 &#215; 10&#8722;5, the lowest ACB-PCR standard employed) of the PIK3CA H1047R, PIK3CA E545K, KRAS G12D, KRAS G12V, HRAS G12D, and BRAF V600E mutations were observed in 34/81 (42%), 29/81 (36%), 51/81 (63%), 9/81 (11%), 70/81 (86%), and 48/81 (59%) of IDCs, respectively. Correlation analysis using available clinicopathological information revealed that PIK3CA H1047R and BRAF V600E MFs correlate positively with maximum tumor dimension. Analysis of IDC subtypes revealed minor mutant subpopulations of critical genes in the MAP kinase pathway (KRAS, HRAS, and BRAF) were prevalent across IDC subtypes. Few triple-negative breast cancers (TNBCs) had appreciable levels of PIK3CA mutation, suggesting that individuals with TNBC may be less responsive to inhibitors of the PI3K/AKT/mTOR pathway. These results suggest that low-frequency hotspot CDMs contribute significantly to the intertumoral and intratumoral genetic heterogeneity of IDCs, which has the potential to impact precision oncology approaches

    Assessing Sex Differences in the Risk of Cardiovascular Disease and Mortality per Increment in Systolic Blood Pressure: A Systematic Review and Meta-Analysis of Follow-Up Studies in the United States

    No full text
    <div><p>In the United States (US), cardiovascular (CV) disease accounts for nearly 20% of national health care expenses. Since costs are expected to increase with the aging population, informative research is necessary to address the growing burden of CV disease and sex-related differences in diagnosis, treatment, and outcomes. Hypertension is a major risk factor for CV disease and mortality. To evaluate whether there are sex-related differences in the effect of systolic blood pressure (SBP) on the risk of CV disease and mortality, we performed a systematic review and meta-analysis. We conducted a comprehensive search using PubMed and Google Scholar to identify US-based studies published prior to 31 December, 2015. We identified eight publications for CV disease risk, which provided 9 female and 8 male effect size (ES) observations. We also identified twelve publications for CV mortality, which provided 10 female and 18 male ES estimates. Our meta-analysis estimated that the pooled ES for increased risk of CV disease per 10 mmHg increment in SBP was 25% for women (95% Confidence Interval (CI): 1.18, 1.32) and 15% for men (95% CI: 1.11, 1.19). The pooled increase in CV mortality per 10 mm Hg SBP increment was similar for both women and men (Women: 1.16; 95% CI: 1.10, 1.23; Men: 1.17; 95% CI: 1.12, 1.22). After adjusting for age and baseline SBP, the results demonstrated that the risk of CV disease per 10 mm Hg SBP increment for women was 1.1-fold higher than men (<i>P</i><0.01; 95% CI: 1.04, 1.17). Heterogeneity was moderate but significant. There was no significant sex difference in CV mortality.</p></div

    Sex-specific and overall effect sizes (ES) for CV mortality per 10 mm Hg increment in SBP.

    No full text
    <p>ES observations are ordered by baseline SBP values. The corresponding ES IDs are listed in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0170218#pone.0170218.t002" target="_blank">Table 2</a>.</p

    Number of features with 0 through 4 detected outliers in the control group of rat RNA-seq data.

    No full text
    <p>Number of features with 0 through 4 detected outliers in the control group of rat RNA-seq data.</p
    corecore